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Abstract 

The random design setting for linear regression concerns estimators based on a random sam- 
ple of covariate/response pairs. This work gives explicit bounds on the prediction error for 
the ordinary least squares estimator and the ridge regression estimator under mild assumptions 
on the covariate/response distributions. In particular, this work provides sharp results on the 
"out-of-samplc" prediction error, as opposed to the "in-sample" (fixed design) error. Our anal- 
ysis also explicitly reveals the effect of noise vs. modeling errors. The approach reveals a close 
connection to the more traditional fixed design setting, and our methods make use of recent ad- 
vances in concentration inequalities (for vectors and matrices). We also describe an application 
of our results to fast least squares computations. 



1 Introduction 

In the random design setting for linear regression, one is given pairs (Xi,Y±), . . . , (X n ,Y n ) of co- 
variates and responses, sampled from a population, where each are random vectors and Yi S M. 
These pairs are hypothesized to have the linear relationship 

Yi = Xj/3 + e, 

for some linear map (3, where the are noise terms. The goal of estimation in this setting is to 
find coefficients j3 based on these (Xi,Yi) pairs such that the expected prediction error on a new 
draw (X, Y) from the population, measured as E[(X T /3 — Y) 2 ], is as small as possible. 

The random design setting stands in contrast to the fixed design setting, where the covariates 
X\, . . . , X n are fixed (non-random), with only the responses Y\, . . . , Y n being treated as random. 
Thus, the covariance structure of the design points is completely known and need not be estimated, 
making the conditions simpler for establishing finite sample guarantees and for studying techniques 
such as dimension reduction and feature selection. However, the fixed design setting does not 
directly address out-of-sample prediction, which is of primary concern in some applications. 

In this work, we show that the ordinary least squares estimator can be readily understood in 
the random design setting almost as naturally as it is in the fixed design setting. Our analysis 
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provides a simple decomposition that decouples the estimation of the covariance structure from 
another quantity resembling the fixed design risk; it is revealed that the accuracy of the covariance 
estimation has but a second-order effect once n > d, whereupon the prediction error converges at 
essentially the same d/n rate as in the fixed design setting. Moreover, the prediction errors of the 
optimal linear predictor — which need not be the same as the Bayes predictor x *— > E\Y\X = x}— 
can be separated into (deterministic) approximation errors and zero-mean noise, which our analysis 
can treat separately in a simple way. The decomposition allows for the straightforward application 
of exponential tail inequalities to all its constituent parts, and we comment on the consequences 
of natural subgaussian moment assumptions that afford sharper tail inequalities, which we also 
provided in this work. Finally, because many of the tail inequalities applicable here also hold under 
relaxed independence assumptions, such as martingale dependence, the sampling assumptions in 
the random design regression can be relaxed to these more general conditions as well. 

The basic form of our analysis for ordinary least squares also generalizes to give an analysis of 
the ridge estimator, which is applicable in infinite-dimensional covariate spaces. This analysis, 
which we specialize to the case where j3 perfectly models the Bayes predictor, is somewhat more 
involved because establishing the accuracy of the empirical second-moment matrix is more delicate. 
Nevertheless, its core still rests upon the same (or similar) exponential tail inequalities used in the 
analysis of ordinary least squares. 



Related work. Many classical analyses of the ordinary least squares estimators in the random 
design setting {e.g., in the context of non-parametric estimators) do not actually show 0{d/n) 
convergence of the mean squared error to that of the best linear predictor. Rather, the error 
relative to the Bayes error is bounded by some multiple (e. g., eight) of the er ror of the optimal 



linear predictor relative to Bayes error, plus a 0{d/n) term (|Gvorfi et all 12004 ): 



E[{X T p ols -E[Y\X}) 2 ] < 8 -E[pT T /3 - E[Y\X}) 2 ] +0{d/n). 

Such bounds are appropriate in non-parametric settings where the error of the optimal linear predic- 
tor also approaches the Bayes error at an 0{d/n) rate. Beyond these classical results, analyses of or- 
dinary least squares often come with non-standard restrictions on applicabilit y or additional depen- 
dencies on the spectrum of the second moment matrix (see the rec ent wor k of Audibert and Catonil 
( 2010b ) for a comprehensive survey of these results). A result of Catoni ( 20041 . Proposition 5.9.1) 
gives a bound on the excess mean squared error of the form 



E[(X T /3 ols -X T /3) 2 ] < O 



d + log(det(£)/det(i7)) 



n 



where U = E[XX T ] is the second-moment matrix of X and E is its empirical counterpart. This 
bound is proved to hold as soon as every linear predictor with low empirical mean squared error 
satisfies certain boundedness conditions. 

This work provides ridge regression bounds explicitly in terms of the vector /3 (as a sequence) and in 
terms of the eigenspectrum of the of the second moment matrix (e.g. the sequence of eigenvectors 
of ELYX T j). Previous analys e s of r idge regression made c ertain boundedness assumptions {e.g., 
Zhang . 2005 ; Smale and Zhou . 20071 ). For instance, Zhang assumes \\X\\ < Bx and \Y — X T (3\ < 
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-Bbias almost surely, and gives the bound 

E[(X T p x - X T {3) 2 ] < A||/3 A - /3|| 2 + o( dl ' X ' {Bhi as + Bxl ^ X 



n 



where d\ \ is a notion of effective dimension at scale A (same as that in (HD). The quantity — 
is then bounded by assuming ||/3|| < oo. ISmale and Zhoul separately bound E[(X T j3\ — X T (3\) 2 ] 



by 0{B\By I \ 2 n) under the more stringent conditions that \Y\ < By and ||X|| < Bx almost 
surely; this is then used to bound E[(X T (3\ — X T f3) 2 ] under explicit boundedness assumptions on 
(3. Our result for ridge regression is given explicitly in terms of E[(X T /3^ — X T (3) 2 ] (the first term in 
Theorem [3]), which can be bounded even when ||/5|| is unbounded. We note that E[(X T (3 X -X T /3) 2 ] 
is precisely the bias term from the standard fixed design analysis of ridge regression, and therefore 
is natural to expect in a random design analysis. 

Recently, lAudibert and Catonil (j2010al lbh derived sharp risk bounds for the ordinary least squares 



estimator and the ridge estimator (in addition to specially developed PAC-Bayesian estimators) in a 
random design setting under very mild assumptions. Their bounds are proved using PAC-Bayesian 
techniques, which allows them to achieve exponential tail inequalities under simple moment con- 
ditions. Their non-asymptotic bound for ordinary least squares holds with probability at least 
1 — 5 and requires 6 > 1/n. This work makes stronger assumptions in some respects, allowing 
fo r 5 to be arbitrarily small (th rough the use of vector and matrix tail inequalities). The analysis 
of Audibert and Catoni (l2010ah for the ridge estimator is established in an asymptotic sense and 



bounds the excess regularized mean squared error rather than the excess mean squared error itself. 
Therefore, the results are not directly comparable to those provided here. 

Our results can be readily applied to the analysis of cer tain technique s for s peeding up over- 
complete least squares computations, originally studied by Drineas et al. ( 20ld ). Central to this 



earlier analysis is the notion of statistical leverage, which we also use in our work. In the appendix, 
we show that these computational techniques can be readily understood in the context of random 
design linear regression. 

Outline. The rest of the paper is organized as follows. Section [2] sets up notations and the basic 
data model used in the analyses. The analysis of ordinary least squares is given in Section [3J and 
the analysis of ridge regression is given in Section 01 Appendix [A] presents the exponential tail 
inequalities used in the analyses, and Appendix [B] discusses the application to fast least squares 
computations. 



2 Preliminaries 

2.1 Notations 

The Euclidean norm of a vector x is denoted by ||x||. The induced spectral norm of a matrix A is 
denoted by ||^4||, i.e., \\A\\ := sup{||Ac|| : ||x|| = 1}; its Frobenius norm is denoted by ||^4||f) i-e., 
\\A\\p = ^2ijA 2 j. For any symmetric and positive semidefinite matrix M (i.e., M = M T and 
M y 0), let denote the norm of a vector x defined by 

||x||m := x T Mx 
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The j'-th eigenvalue of a symmetric matrix A is denoted by Xj(A), where \i(A) > \2(A) > . . . and 
the smallest and largest eigenvalues of a symmetric matrix A are denoted by X m i n (A) and A max (^4), 
respectively. 



2.2 Linear regression 



Let X be a random vector of covariates (features) and Y £ R be a response variable (label), 
both sampled from some (unknown) underlying joint distribution. We are interested in linear 
predictors of the response variable from the covariates, with performance measured under a standard 
probabilistic model of the covariate/response pairs. 

In the context of linear regression, the quality of a linear prediction X T w of Y from X is typically 
measured by the squared error (X T w — Y) 2 . The mean squared error of a linear predictor w is 
given by 

L{w) : = E \(X T w - Yf 
where the expectation is taken over both X and Y . Let 

E := E[II T ] 

be the second moment matrix of X. 

We assume that U is invertible, so there is a unique minimizer of L given by 

P := U^EIXY}. 

The excess mean squared error of w over the minimum is 

L(w)-L((3) = \\w-f3\\l. 



2.3 Data model 



We are interested in estimating a vector /3 of coefficients from n observed random covariate/response 
pairs (Xi,Y\), . . . , (X n ,Y n ). We assume these pairs are independent copies of (X, Y), i.e., sampled 
i.i.d. from the (unknown) distribution over (X, Y). The quality of an estimator /3 will be judged 
by its excess loss ||/3 — f3\\ 2 s , as discussed above. 

We now state conditions on the distribution of the random pair (X, Y). 



2.3.1 Response model 



The response model we consider is a relaxation of the typical Gaussian model by allowing for model 
approximation error and general subgaussian noise. In particular, define the random variables 

T]{X) := Y - E[Y\X] and bias(X) := E[Y\X] - X T (3, 

where r]{X) corresponds to the response noise, and bias(X) corresponds to the approximation error 
of f3. This gives the modeling equation 

Y = X T P + bias(X) + T](X). 
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Conditioned on X, the noise rj(X) is a random, while the approximation error bias(X) is deter- 
ministic. 

We assume the following condition on the noise rj(X). 

Condition 1 (Subgaussian noise). There exist a finite <r no i se > such that for all A £ R, almost 
surely: 

E [exp(ArKX)) I X] < exp (AV 2 oise /2) . 

In some cases, we make the further assumption on the approximation error bias(X). The quantity 
-Bbias in the following only appears in lower order terms (or as log (.Bbias)) in the main bounds. 

Condition 2 (Bounded approximation error). There exist a finite -Bbias > such that for all A £ K, 
almost surely: 

HE-^XbiaspOH < B hias Vd. 

It is possible to relax this condition to moment bounds, simply by using a different exponential tail 
inequality in the analysis. We do not consider this relaxation for sake of simplicity. 

2.3.2 Covariate model 

We separately consider two conditions on X. The first requires that X has subgaussian moments 
in every direction after whitening (the linear transformation x t— > S~ l / 2 x). 

Condition 3 (Subgaussian projections). There exists a finite pi jCO v > 1 such that: 

< exp (pi |COV • ||a|| 2 /2) Vq G R d . 

The second condition requires that the squared length of X (again, after whitening) is never more 
than a constant factor greater than its expectation. 

Condition 4 (Bounded statistical leverage). There exists a finite p2,cov > 1 such that almost 
surely: 

lir-v^n \\z- l/2 x\\ 

Vd ~ y/E[\\E-V*X\\*] ~ P2 ' C ° V ' 

This condition can be seen as being analogous to a Bernstein-like condition (e.g., an assumed 
almost-sure upper bound on a random variable and a known variance; in the above, p2,cov is the 
ratio of these two quantities). 

3 Ordinary least squares 

We now work in a finite dimensional setting where iGf* The empirical mean squared error of a 
linear predictor w is 

1 n 

n ^— ' 

i=i 



E 



exp (c^E^^X 
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Let 



Z:=Y,X t Xj/n 



i=l 

be the empirical second moment matrix of X\, . . . , X n . Throughout, we denote empirical expecta- 
tions by !£[■]; so, for instance, 

L(w) = E(X T w - Y) 2 and E = E[XX T ]. 

If £ is invertible, then the unique minimizer, /3 i s , is is given by ordinary least squares: 

/§ 1 S := E^MiXY}. 



3.1 Review: the fixed design setting 

In the fixed design setting, the Xi are regarded as deterministic vectors in R d , so the only ran- 
domness involved is the sampling of the Y{. Here, infixed := SILi -^iXj /n = Z! (a deterministic 
quantity, assumed without loss of generality to be invertible), and 



Axed :=^L (^X> E M) 



is the unique minimizer of 



Lfaed(w) : = E 



1 - 



T w - Yi) 2 



1 = 1 



Here, we are interested in the excess squared error: 

Lftxcdiw) - ifixed(/3) = \\w — /3fixed|ll7 flxed 

In this case, the analysis under suitable modifications of Condition [T] is standard. 
Proposition 1 (Fixed design). Suppose infixed is invertible and X S W 1 . Ifv&r(Yi) = a 2 , then: 



E 



bis Abcedlll^ 



d^_ 

n 



(where the expectation is over the randomness in the Yi's). 
Instead, suppose that there exists cr noisc > such that 



E 



exp 



< exp (||a|| 2 o-^ oise /2) 



for all (cti, . . . , a n ) S M. n . For 5 E (0, 1), we have that with probability at least 1 — 5, 

a 2 noisc ■ (d + 2y/d\og(l/6) + 2 log(l/<5) N 



3ols -/?fixedb flxed < 



J? 
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Proof. The claim follows immediately from the definitions of (3 \ s and /3fi X ed) and by Lemma HH □ 



Our results for the random design setting will be directly comparable to the bound obtained here 
for fixed design. 

Remark 1 (Approximation error in the fixed design setting) . Note that modeling error has no effect 
on the bounds above. That is, there is no dependence on the modeling error with regards to the 
excess loss in the fixed design setting. 



3.2 Out-of-sample prediction error: correct model 

Our main results are largely consequences of the decompositions in Lemma Q] and Lemma [21 com- 
bined with probability tail inequalities given in Appendix [Al 

First, we present the results for the case where bias(X) = 0, i.e., when the linear model is correct. 

Lemma 1 (Random design decomposition; correct model). Suppose E y and ~E[Y\X] = X T f3, 
then 



->ols 



\l 



E 



r l/2£-l£l/2 











E 


E 



E- l ' 2 Xn{X) 



< 

Proof. Since Y — X T (3 = rj(X), we have 

E[XY] -E/3 = E[Xt](X)}. 

Hence, using the definitions of /3 i s , 

/3ois-/3= E^XniX) . 

Furthermore, 

£1/2 _ ^ = £l/2£-V2 E [Z-V2X V (X) . 

Observe WE 1 ' 2 E~ l l 2 \\ = HIT 1 / 2 E 1 / 2 ]] = WE 1 ' 2 E^ 1 E l l 2 \\ 1/2 , and the conclusion follows. 



□ 



This decomposition shows that as long as H^ 1 / 2 !^ 1 Z' 1 / 2 || = O(l), then the rate at which ||/3 i s — 
tends to zero is controlled by ||E[i7 _1 / 2 Xr/(X)]|| 2 , which is essentially the fixed design excess loss 
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To state our main bound, we first define the following quantities for all 5 G (0, 1) 
n hS := 70p 2 cov (dlog41 + log(2/<5)) 
n 2 ,8 ■= 4p| cov d\og(d/S) 

1 



1 



lOpi.cov f ^ 32(rflog41+log(2/(S)) " | 2(dlog41+log(2/i5)) \ 



1 

Note that 1 < Ki t s. n < oo and 1 < i^2,5,n < OO) respectively, when n > n\ t s and n > n 2t s- 
Furthermore, 

lim Ki s n = 1, lim K 2t s,n = 1. 

n— >oo n— >oo 

Our first result follows. 

Theorem 1 (Correct model). Fix any 5 G (0,1). Suppose that Conditions {I\ and\M hold and that 
K[Y\X] = f3 • X. If n > ni g, then with probability at least 1 — 26, we have 

• (Matrix errors) 

|| r i/2£-l r i/2|| <K 1An <5; 

• (Excess loss) 



2 ^oise • (d + 2Vdlog(l/<5) + 2 log(l/«5) 



n 



Suppose that Conditions^ and [7] ftoZd and that bias(X) =0. If n > U2,s> then with probability at 
least 1 — 25, we have 

• (Matrix errors) 

H^l/2 £-l r l/2|| < K2gn < g. 

• (Excess loss) 



2 ^oise • (d + 2 v / dlog(l/^) + 2 log(l/«5) 



n 



Remark 2 (Accuracy of £). Observe that HZ 11 / 2 !? 1 Z' 1 / 2 || < 5 is not a particularly stringent 

condition on accuracy. In particular, a scaling of pi tCOY = @(^/n) (or p 2 ,cov = Q(\/ra)) would imply 
11^1/2^-1^1/211 ig constant . 



S 



3.3 Out-of-sample prediction error: misspecified model 



Now we state our results in the general case where bias(X) 7^ is allowed, i.e., a misspecified linear 
model. Again, we begin with a basic decomposition. 

Lemma 2 (Random design decomposition; misspecified model). If E >- ; then 

E 



< 



E 



ir ^(biaspf) + v (x)) 

Z^XbiaspT) 



< 2 



E 



Z^Xbiaspf) 



+ 2 



E 



Furthermore, 



E 



E~ l Xbias(X) 



< 



and 



E 



E- l Xn(X) 



< 



^1/2^-1^1/2 



^1/2^-1^1/2 



2 








E 


E 



iT 1/2 Xbias(X) 











E 





Proof. Since F — X T /3 = bias(X) + n(X), we have 

E[XY] -Ej3 = E[X(bias(X) + ??(X))]. 

Using the definitions of (3 Q \ S , multiplying both sides on the left by E 1 / 2 !]^ 1 (which exists given the 
assumption E >- 0) gives 

Z 1/2 (/3ois - P) = E 1 ' 2 ^ 1 ' 2 E [r- 1 / 2 X t (bias(X / ) + n, 

= E 1/2 E~ l E l/2 E ir 1 / 2 ^ bias(Xj) + E 1/2 E~ 1/2 E JT 1 / 2 ^^ 
The claims now follow. □ 

Our main result for ordinary least squares, with approximation error, follows. 

Theorem 2 (Misspecified model). Fix any 5 £ (0,1). Suppose that Conditions{l\ and [3] /ioZd. 
If n > n\ s, then with probability at least 1 — 35, the following holds: 



(Matrix errors) 



1^1/2^.-1^1/211 <K 1Sn <5 
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(Approximation error contribution) 



E 



£ _1 XbiaspT) 



< K 



l,<5,n 



'^[WS-^Xbi^XW] (l + 81og(l/«5)) 3B 2 ias dlog 2 (l/5) 



+ 



(See Remark^ below for interpretation). 
(Noise contribution) and 



E 



E~ l Xri(X) 



< K h5, 



al oisc ■ Id + 2Vdlog(l/<J) + 2 log(l/«5) 



(Excess loss) 



Jols 



< 



E 



iT 1 * biaspQ 



+ 



A,' 



E 



U^Xr/iX) 



<y approximation error contribution 



V noise contribution 



5 ols 



< 2 



E 



r _1 Xbias(X) 



+ 2 



E 



E~ l Xr]{X) 



Instead, if Conditions^ pj| and [7] ftoZd, i/ien i/ie above claims hold with 77-2,5 o,nd ^2,s,n ^ n place of 
nix and Kix n . 



Remark 3. Since f3 = argmin^ E[(X T w-Y) 2 ], the excess loss bound ||/3q1s 
into an oracle inequality with the following identity: 



\s can be translated 



E[(X T /3 ols - Y) 2 ] = inf E[(X T w - Y) 2 } + 



Jols 



Remark 4 (Approximation error interpretation) . Under Condition 01 the term which governs the 
approximation error, the quantity E[||Z' _1 / 2 X|| 2 bias(X) 2 ], is bounded as 



E 



|IT 1/2 Xbias(X)|| 2 ] < pl iCOV • d • E [bias(X) 5 



A similar bound can be obtained under Conditions [2] and see Lemma [3 

Remark 5 (Comparison to fixed design). The bounds in Theorems [T] and [2] reveal the relative effect 
of approximation error E[bias(AT) 2 ] and stochastic noise (through c 2 oisc )- The main leading factors, 
Ki t s, n and i^2,<5,n, quickly approach 1 after n > m t s and n > n 2,6, respectively. If we disregard 
Kig n and K2Sm then the bounds from Theorem [1] essentially match the those in the usual fixed 
design and Gaussian noise setting (where the conditional response Y\ X is assumed to have a normal 
Af(X T (3, c 2 oisc ) distribution); see Proposition Q] for comparison. 

Remark 6 (The <7 no i se = case and a tight upper bound). If £J no i se = (no stochastic noise), then 
the excess loss is entirely due to approximation error. In this case, 



Jols 



\l 



E 



ir 1 ^ bias(X) 



\t[^n s ^xb\os{x)]\\ 2 . 
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Furthermore, Theorem [2] bounds this as: 



E 



IT 1 A" bias ( A") 



2 



r 



< K 2 I "J" J v ' — ' + 



4E[||i7- 1 /2 Xbias(x) ||2] (i + 8 log(l/<y)) , 35 2 ias dlog 2 (l/5)' 



71 



Note that E^E^E 1 / 2 « J for large enough n. In particular, with probability greater than 1 — d, 
if ?i > cnij where c is a constant (or n > cn 2j s), we have that: 

h&iE-^XbiasiX^W 2 < WEIZ! 1 ^ IT 1 Xbiaa(X)]\\ 2 < 2\\E[E^ 2 X bias(X)]|| 2 

(which follows from the arguments provided in Lemmas [3] and |4|). Furthermore, observe that 
E[||E[ir 1 / 2 Xbias(A')]|| 2 ] = (l/n)E[||A7- 1 /2xbias(X)|| 2 ] (where the outside expectation is with 
respect to the sample X±, . . . X n ). Hence, the bound given for the approximation error contribution 
is essentially tight, up to constant factors and lower order terms, for constant 5. 

3.4 Analysis of ordinary least squares 

We separately control ||I7 1 / 2 i7~ 1 i7 1 / 2 || under Condition [3] and Condition HI 
Lemma 3. For all 5 £ (0, 1), if Condition^ holds and n > ni <y, then 



EyO A \\E 1/2 E~ 1 E 1/2 \\<K l 



> 1-5 



Pr 

and that Ki$ n < 5. 

Proof. Let % := E~ l / 2 Xi for i = 1, . . . , n, and E := (1/n) £™ =1 XiXj . Let E be the event that 



A (E) > 1 Pl ' cov I / 32 (rfl ° g(1 + 2/ °-° 5) + log ( 2 ^) 2 (dl ° g(1 ± 2/a05) ± lQ g( 2 ^)) ) 
mml ' ~ 1 - 2/0.05 I V n n J 

By Lemma [16] (with rj = 0.05^, Pr[J3] > 1 — S. Now assume the event E holds. The lower 
bound on n ensures that A m i n (i7) > 0, which implies that E = E 1 / 2 EE 1 / 2 y 0. Moreover, since 

^1/2^-1^1/2 = ^-1 ; 

H^l/2^-1^1/211 = ||£-1 1| _ < n 

Amin(Z') 



Lemma 4. For all 8 E (0, 1), if Condition^ holds and n > n 2 ^, th 



en 



E^o a lli; 1 / 2 ^ 1 !; 1 / 2 !! < K 2&n 



> 1-5 



Pr 

and that i^2,<5,n < 5. 

Proof. Analogous to the proof of Lemma [3] (using Lemma [T7] in place Lemma [T6]) . □ 

Under Condition OQ we control W^E' 1 / 2 X n{X)]\\ 2 using a tail inequality for certain quadratic 
forms of subgaussian random vectors (Lemma [1~4"|) . 
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Lemma 5. Suppose Condition [7] holds. Fix 5 E (0,1). Conditioned on £ y 0, we have that with 
probability at least 1 — 8, 



E[£~ 1/2 Xr](X)} 



2 a 2 oise (d + 2Vdlog(l/«5) + 2 log(l/<5) 

< ^ 

n 



Proof. We condition on X%, . . . ,X n , and consider the matrix A S M. dxn whose i-th column is 
(l/^/n)X!^ 1 ^ 2 Xi, so A T A = I. From Conditions [T] and Lemma [T4l the result follows. □ 



We control ||E[Z7 -1 / 2 bias(A')] || 2 using a tail inequality for sums of random vectors (Lemma [To" 
Lemma 6. Suppose Condition^ holds. Fix 5 6 (0, 1). With probability at least 1 — 5, 



E[iT 1/2 Xbias(X)] 



< 



4E[||i7- 1 /2xbias(X)|| 2 ] (l + 8Iog(l/5)) 3i? 2 ias ^og 2 (lM) 



+ 



n 



Proof. The optimality /3 implies E[Xj bias(Xj)] = E[X bias(X)] = for all i = 1, . . . , n. Using this 
fact and the bound ||Z' _1 / 2 X bias(X)|| < -BbiasV^ from Condition[2j Lemma [T5l implies: 



Pr 



E[E-^ 2 Xbias(X)] 



< 



'e [nr-vsxbias^)!! 2 ] (1 + V / 81og(l/5)) : 



+ 



4B biasX /dlog(l/<$)' 



3?) 



> 1-5, 



and the claim follows 



□ 



The expectation EfllZr^xbiaspOH 2 ] that appears in the previous lemma can be bounded in 
terms of E[bias(X) 2 ] under our conditions. 

Lemma 7. If Conditions^ and\3\ hold, then for any A > ; 



E 



£- l l 2 X biaspT) || 2 < pi iCOV • d • E[bias(X) 2 ] 



/ 



1 + 



log max 



Api jCOV E[bias(X)' 



log max 



d 



+ 



Bl d 

b l as 



Api,covE[bias(X)' 



+ A 



d 



V 



If Condition [7] holds, th 



en 



E 



£-^ 2 Xbias(X)\\ 2 ] < p| cov • d ■ E [bias(X) 2 
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Proof. For the first part of the claim, we assume Conditions [3] and [I] hold. Let E be the event that 

\\Z~ 1/2 X\\ 2 < Pi,cov • (d + y/d\og(l/6) + log(l/<5) N 
By Lemma Pr[Ey > 1 — S. Therefore 



E 



|i;- 1 / 2 jS£:bias(X)|| 2 l =E\\\U- 1 / 2 Xbias(X)f -tlEs]] + E [llir 1 / 2 ^ bias(X)|| 2 • (l-t[E 5 }) 

< Pi.cov • (d + y/d\og(l/6) + log(l/<J)) • E [bias(X) 2 • 1 [Eg]] 
+ B* ias .d-E[(l-1[E S })] 

< Pi.cov • (d + y/d\og(l/6) + log(l/<5)) • E[bias(X) 2 ] + B 2 ias ■ d ■ 5. 



Choosing 5 := min{A / oi iCOV E[bias(X) 2 ]/(i?^ ias (i), 1} completes the proof of the first part. 

For the second part, note that under Condition [H we have || A7 _1 / 2 X|| 2 < p\ cow d almost surely, so 
the claim follows immediately. □ 

4 Ridge regression 

In infinite dimensional spaces, the ordinary least squares estimator is not applicable (note that 
our analysis hinges on the invertibility of E). A natural alternative is the ridge estimator: in- 
stead of minimizing the empirical mean squared error, the ridge estimator minimizes the empirical 
regularized mean squared error. 

For a fixed A > 0, the regularized mean squared error and the empirical regularized error of a linear 
predictor w are defined as 

L x (w) :=E(X T w -Y) 2 + X\\w\\ 2 and L x {w) :=t{Xj w -Yi) 2 + X\\w\\ 2 . 

The minimizer (3\ of the regularized mean squared error is given by 

p x := (U + Xiy^XY]. 

The ridge estimator f3\ is the minimizer of the empirical regularized mean squared error, and is 
given by 

$x ■= (^ + AJ)' 1 E[I7]. 
It is convenient to define the X-regularized matrices E\ and IJ\ as 

E x ■= £ + A/ and A7 A := E + XI 

so that 

Px = E~ l E[XY] and $ x = E^E[XY]. 

Due to the random design, f3 x is n °t generally an unbiased estimator of (3 X ; this is a critical issue 
in our analysis. 
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Throughout this section, we assume that our representation is rich enough so that 

E[Y\X] = X T f3. 

However, we will not require that ||/3|| 2 be finite. The specific conditions (in addition to Condition[TJ 
are given as follows. 

Condition 5 (Ridge conditions). 

1. E[Y\X] = X T (3 almost surely. That is, the regression function is perfectly modeled by /3. 

2. There exists p\ > 1 such that almost surely, 

||Z7 1/2 X|| I|i7~ 1/2 X|| 

<Px 



n\^ 1/2 x\\* Jz^gk 

where Ai(A7), X2(ZJ), ■ ■ ■ are the eigenvalues of S. 
3. There exists i?bias A > such that the approximation error biasApO due to f3\, defined as 

bias A pO :=X T (P-p x ), 

is bounded almost surely as 

|bias A (X)| < B hiaax . 

Remark 7. The second part is analogous to the bounded statistical leverage condition ( Condition 2]) 

— 1/2 

except with X-whitening (the linear transformation x i— > S x x) instead of whitening. Note that 
Y^j + X) —> d (the dimension of the covariate space) and U\ — > U as A — > 0. 

Remark 8. As with the quantity -Bbias from Condition [2] in the ordinary least squares analysis, the 
quantity B^ asx here only appears in lower order terms in the results. 

4.1 Review: ridge regression in the fixed design setting 

Again, in the fixed design setting, X±,...X n are fixed (non-random) points, and, again, define 
^fixed := Y7=l XiXj jn (a deterministic quantity). Here, j3\ is an unbiased estimate of the mini- 
mizer of the true regularized loss, i.e., 



1 n 

-T, X ' Y > 



n 
i=l 



/3A,fixcd := E[/3 A ] = (17 fixcd + Ai") E 

where the expectation is with respect to the Yi's. 
The following bias-variance decomposition is useful: 

where the expectation is with respect to the randomness in the Y^s. Here, the first term represents 
the bias due to regularization and the second is the variance. 

The following straightforward lemma provides a bound on the risk of ridge regression. 
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Proposition 2. Denote the singular values of ^a xc d by A^f^d (in decreasing order) and define the 
effective dimension as 



V A i,fixcd + Ay 



2 



//var(li) = a 2 , then 



V/3 2 ^ 



%JI/?a - fillip = ^tttt^aw 4 

j 



Affixed tJ 2 ^A,fi 

ed/A) 2 n 



where the expectation is with respect to the Yi 's. 



Remark 9 (Approximation error) . Again, note that modeling error has no effect on the fixed design 
excess loss for ridge regression. 

The results in the random design case are comparable to this bound, in certain ways. 



4.2 Out-of-sample prediction error: ridge regression 

Due to the random design, (3\ may be a biased estimate of j3\. For the sake of analysis, this motivates 
us to consider another estimate, f3\, which is the conditional expectation of (3\ (conditioned on 
X±,..., X n ). Precisely, 

p x := E[/3 A |Xi, . . . X n ] = Z^E[X(P ■ X)} = 

where the expectation is with respect to the Y^s. These definitions lead to the following natural 
decomposition. 

Lemma 8 (Random design ridge decomposition). Assume Condition^ holds. We have that 
||/3a - P\\x < \\P X - /3\\z + \\/3 x - X \\ S + \\h - fr x \\ E 

\\k - < 3 {\\Px \\h -Px\\l + \\h- hwl) ■ 

Remark 10 (Special case: ordinary least squares (A = 0)). Here, j3\ = (3\ = (3 if XJ is invertible and 
A = 0, in which case the constant 3 can be replaced by 2 in the second inequality. 

Proof. A norm obeys the triangle inequality, and (a + b + c) 2 < 3(a 2 + b 2 + c 2 ). □ 

Our main result for ridge regression provides a bound on each of these terms. 

Theorem 3 (Ridge regression). Suppose that Conditions [I] and [5| hold. Let Xi(XJ), X2(XJ), ■ ■ ■ 
denote the eigenvalues of U, and define the following notions of effective dimensions: 

dl > x := £ aTMTa and d2 ^-=\\xJ^x) • (1) 
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Define the X-whitened error matrix as 

A X :=U~ 1/2 (IJ-U)^ 1/2 (2) 

and 



I _ I .1 4 Pj rf l,Alog(^l ; A/ (5 ) _|_ Pl d i,\ l °g( d i,\/ 5 ) 



Suppose that 5 £ (0, 1/8), A < A max (A7), and 

n > l6p 2 x d lt xlog{di,\/5). 

(see RemarkUM below). There exists a universal constant < c < 40 (explicit constants are provided 
in the lemmas) such that following claims hold with probability at least 1 — 46: 



(Matrix errors) 



and 



iiA A |, F < c (i + v^m) f^ l/2x J^-^ + ^wg^gwj) 

(First term) 



n n 



.i 



(Second term) 
\\h-Px\\l<cKi An (l + log(l/5)) 



. (1 ^ K[||^ 1/2 (Xbia SA (^) - A/3 A )|| 2 ] 



n 



n 2 



Furthermore (for interpretation), 

E[Xbias A (X) - Xj3\) = 

and 

urn % j 

(see Remark[T3\ below). 



E[\\U x 1/2 (Xhi aSx (X)-Xf3 x )\\ 2 ] < \\/3 x -/3\\ 2 J (2p 2 d 1:X + 2) 
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(Third term) 



h - ftlll < K2x > s > n f oisc ( d 2 , A + ,/d 2 , A ||A AI | F 



2K 3/2 a 2 17 

+ A ^ nmse JU x + < 



n 



(Excess loss) 



< 



h-P\\z + \\Px-Px\\z + \\P\-f3\\\z; 



y/ first term y/ second term \/ third term 



lift - <0||£ < 3 (||ft - /3||i + ||ft - ftlll; + lift - ft ||^ 
Remark 11 (Overall form). For a fixed A, the overall bound roughly has the form 



>x-P\\e<\\Px-P\\e- I i + o|PL 



p|d 1>A log(l/ ( 5) 



+ 



<4ise fax + 2Vrf 2 ,Alog(l/«5) + 2 log(l/<5)) 



J? 



+ lower order o(l/ v / n) terms 



where 



£./3fA,(£)/(l + A^)/A)2. 



Remark 12 (Special case: ordinary least squares (A = 0)). Theorem Q] is essentially a special case 
for A = (with minor differences in constants and lower order terms). To see this, note that 
d,2 x = d\ \ = d and take p\ = p2,cov so that Condition [5] holds. It is clear that the first and second 
terms from Theorem [3] are zero in the case of ordinary least squares, and the third term gives rise to 
a nearly identical excess loss bound (in comparison to Theorem [1]). In particular, the dependencies 
on all terms which are 0(l/n) are identical (up to constants), and the terms which depend on 
||A a ||f are lower order (relative to 1/n). 

Remark 13 (Comparison to fixed design). The random design setting behaves much like the fixed 
design, with the notable exception of the second term in the decomposition. This term behaves much 
like modeling error (in the finite dimensional case), since X bias A (X) — \(3\ is mean 0. Furthermore, 
since 

-1/2 



E[||i7 A - 1/z (Xbias A (X)-A/3 A )r] < ||/3 A - /3|^(2p^ 1)A + 2), 



Note that 



II 



this second term is a lower order term compared to the first term ||/3 A 
is precisely the bias term from the fixed design analysis, except with the eigenvalues in place 

of the eigenvalues of the fixed design matrix. 

Remark 14 (Random design effects and scaling A) . Note that above condition allows one to see the 
effects of scaling A, such as the common setting of A = Q(l/y/n). As long as p\d\ \ scales in a mild 
way with A, then the random design has little effect. 
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Remark 15 (Conditions: A < A max (£) and 5 G (0,1/8)). These conditions allow for a simplified 

]/2 ~ l 1/2 

expression for the matrix error term \\SJ ZJ X S\ \\ (through ||Aa||) and are rather mild. The 
proof of Lemma [TOl provides the general expression, even if these conditions do not hold. 



4.3 Analysis of ridge regression 

Recall the definitions of g?2,a> ^i,Aj an d Aa from ([1]) and ([2]) in Theorem [3l First, we bound the 
Frobenius and spectral norms of A a in terms of c^a, ^1A> an d the quantities from Condition [5j 
Then, assuming ||Aa|| < 1, we proceed to bound the various terms in the decomposition from 
Lemma [8] using these same quantities. 

Lemma 9 (Frobenius error concentration). Assume Condition^ holds. With probability at least 
1-5, 



iia a ii f < (i + v^m) f m - y'-^ + wm^+^ww 

Proof. Define the A- whitened random vectors 



so that the random matrices 



y v 

i,w ■ — *->\ -A-i 



have expectation zero. In these terms, Aa = {l/n) Y17=i Mi- Observe that ||Aa|| f is the inner 
product 

||Aa||| = (Aa,Aa) 

where {A, B) := tr(AB T ). 

We apply Lemma[l5j treating Mj as random vectors with inner product (•, •), to bound ||Aa|| f with 
probability at least 1 — 5. Note that E[Mj] = and, by Condition [5j that 

E[||Mi|||] = E[(X^ w xJ,X itW XJ)} - (E W ,U W ) 



E[\\X hW \\ 4 ] - tr(Z, 
E[\\U- 1/2 Xf 



Also, 
and 
so that 



^2, A- 



|^?,w^Q!toIIf — l|AQ,ui|| 4 < P\di \ 



\^w\\f ~ (A^um 2Jw) — <^2,A 



||Mi||f < H^tu^iollF + ||^||f < P\dl,X + \J d2,\- 

Therefore, Lemma [15] implies the claim, so the proof is complete. □ 
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Lemma 10 (Spectral error concentration). Assume Condition [5| holds. Suppose that A < Ai(Z") 
and that 5 6 (0, 1/8). With probability at least 1 — 5, 



A A || < 



4/^di,A log(rfi,A/a) 2p^di iA log(di |A /<?) 
n 3n 



Remark 16. The condition that A < Ai(A7) is only needed to simplify the bound on ||A A ||; it ensures 
a lower bound on d\^\ (since d\ \ — > as A — > oo), but this can be easily removed with a somewhat 
more cumbersome bound. 

Proof. Define M; = S^ 1/2 (X t XJ - £)A7~ 1/2 . Note that by Condition^ 
A max (M i ) < \\E~ 1/2 Xi\\ 2 < pld ltX , 

A max (-A^) < A max (^ 1/2 i:i:- 1/2 ) < x ^l x < Pxdi,x, 

A max (E[A J f 2 ]) < A max (E[(A7 A : 1/2 X i X J T A7; 1/2 ) 2 ]) < /0 2 d 1 , A A max (17; 1/2 A7A7; 1/2 ) < ^d 1|Aj 

tr(E[A£ 2 ]) < ir(E[(Z- x ^ Xi XjZ^f]) < P \d^ t^ 2 UU^ 2 ) = ^1]^ = p!<A- 

From Lemma [T8l for i > 2.6 



Pr 



V n / V n ^ n 



The claim follows for £ = 21og(d 1)A /<5) for 5 < 1/8. □ 

Now we bound the (second and third) terms in the decomposition from Lemma 

Lemma 11 (Second term in ridge decomposition). Assume Condition^ holds. If ||A A || < 1, then 



1. \\h-Px\\ l S < (1 _,, 1 A , |)2 ||E[^ 1/2 (Xbias A (X)-A/3 A )]|| 2 

2. urei/t probability at least 1 — 8, 



1 



E[||i7 A - 1/2 (Xbias A (X)-A/3 A )|| 2 ] 



* " (T^PW ^ + 321 ° g(1/5)) — n 

S(pld^BL s +\\Px-(3\\l)log 2 (l/5) 



n 2 



Furthermore, 

urn I ^ 



E[||^ 1/2 prbias A (X)-A/3 A )|| 2 ] < ||/3 A - /3|||(2p|d 1)A + 2) 
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Proof. Since /3 A = 27, 1 27/3 and /3 A = 27, 1 27 A /3 A = 27, 1 (E(3\ + A/3 A ), we have that 



< 



< 



|27 A [ZP-(£Px + \Px))\\e 
l^ 2 ^ 1 (27/3 - (27/3 A + A/3 A ^ 



WE^E^S^f \\nE~ ir2 (X bias A (7f) - A/3 A )]|| 2 . 
The first claim now follows from Lemma [T3l 

Now we prove the second claim using Lemma [T5l First, note that for each i, 

Hr-^CXibiasApTi) - A/3 A )|| < ||27- 1/2 X i bias A (2Ti)|| + ||AI7- 1/2 /3 A | 
Each term can be further bounded using Condition [5] as 

||27; 1/2 (X i bias A (2T i ))|| < p A \/M bias A (2Q)| < px^/d^Bu 

and 



27" i/2 27/3 - (£/? A + A/3 A ) 



-1/2. 



ias A 



\\^~x 1/2 m 2 = Y,% 



J (A + Aj) 3 " V ' (A + Aj) 2 
By Lemma [T5l we have that with probability at least 1 — 5, 



ix-PWz- 



E[27 A 1/2 (Xbias A (X)-A/3 A )]|| < 



'E[||17r 1/2 (Xbias A (X)-A/3 A 



71 



1 + V81og(l/5) 



so 



(4/3)( /3Ax /^Ii? biaSA + ||/3 A -/3|| I ;) i 
H log (1/5) 



n 



E[Er 1/2 (XbiaSA(X)-AA)]|| 2 < 



2E[ ||r r ^(Xbia SAm -Afe)P 1 , 1 + V 3 15iW j) 



n 



+ 



(8/3)(p AV ^£ biaSA + ||/3 A - fills)' 
?? 2 



log 2 (l/5) 



< 



4E[||27; 1/2 (2Tbias A (X)-A/3 A )|| 2 ] 



(l + 81og(l/<5)) 



+ *o log (1/5). 
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Finally, 



E[\\U- 1/2 (Xbm Sx (X) - A/3 A )|| 2 ] < 2E[||I7~ 1/2 Xbias A (X)|| 2 ] + 2||/3 A - f3\\% 



<2p 2 d liA E[bias A p0 2 ] + 2||/? A 
= ||^-/3|||(2^ 1)A + 2) 



which proves the last claim. 



□ 



Lemma 12 (Third term in ridge decomposition). Assume Condition{l\holds. Let A := [X\\ ■ ■ ■ \X n ] 
be the random matrix whose i-th column is X{. Let 

M := -^A r S7 1 ES7 1 A. 

We have 



Pr 



A - Ms < ^noisetr(M) + 2a 2 oisc ^(M) ||M|| log(l/<5) + 2a 2 oisc ||M|| log(l/<5) \X 1 ,...,X n 

Furthermore, i/||A A || < 1, then 
1 1 



> 1-5. 



tr(M) < 



n (1-||A A | 



d 2 ,A + A/^2,A||A A || F and \\M\\ < 



1 



1 



n 1 - A> 



Proof. Let Z := (r]{X{), . . . , rj(X n )) be the random vector whose i-th component is 77 (JQ). By- 
definition of fix and fix, 



\\fix ~ fixWl = \\Zx\nXY] - E[XE[Y|X]])|| 



= \\Z X - 1 (E[X V (X)])\\1 
= \\{l/n)E^AZ\\l 
= \M x l 2 Zf. 

By Lemma [HI we have that with probability at least 1 — 5 (conditioned on X\, . . . , X n ), 
Wfix -fixWs < ^oisotr(M) + 2a 2 loisc V / tr(M2) log(l/<5) + 2cT 2 oise ||M|| log(l/5) 

< cr 2 oise tr(M) + 2a 2 isoV / tr(M)||M|| log(l/<J) + 2a 2 oisc ||M|| log(l/<5). 
The second step uses the fact that M is positive semi-definite and therefore 

tr(M 2 ) = A,(M) 2 < J2Xj(M) • A max (M) = tr(M)||M|| 

j j 

where we use the notation Xj(H) to denote the j-th largest eigenvalue of a symmetric matrix H. 
This gives the first claim. 
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Now observe that since (l/n)AA T = E, 

\\M\\ = 1 • WE^E-'Af 



n- 



< 1 • {^^f^^Y^ 1 ^ 



A 



n 



<--\\e\ /2 E7 1 'Y 
n A 



-^maxl^A ^A A > 



< 



n 1-||A A || 

where the last inequality follows from the assumption ||Aa|| < 1 and Lemma [T3j Moreover, 



tr(M) - 

n z 



- • t^E^EE^A) = \ ■ tr(E^EE^E). 



To bound this trace expression, we first define the A- whitened versions of E, E, and Ex' 



Y -1 Y" 1 1/2 1/2 



y< Y" 1 1/^ Y 1 1/^ 

Zj w .— Zj^ Z - iZj \ 



Ey 



Z\ l,2 E^E~ 



-1/2 



J X,w ■— ^ A 

We have the following identity: 

tr(E^EE^E) = tr(E^ w E w E-^E w ). 
By von Neumann's theorem ( Horn and Johnson! 1985 . page 423), 

^ T (^\w^wE x ^ w E w ) < Xj(E x ^ ll E w E x ^ ll ) • \j(E u 

3 

and by Ostrowski's theorem (IHorn and Johnsonl . Il985l . Theorem 4.5.9), 
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Therefore, 



- n - ||A A ||) 2 ' ? Xj ^ w) ' 



(1- 


IIAaII) 2 




i 


(1- 


IIAaII) 2 




i 


(1- 


IIAaII) 2 



^ n-HA ^ - Ew a + 4 EM^ 4 E(Aj(4) - 



where the second inequality follows from Lemma [13} and the third inequality follows from Cauchy- 
Schwarz. Since 



and, by Mirsky's theorem (jStewart and Sun , Il99(l Corollary 4.13) 



^^(A^I^u) — A-^Z^)) 2 < — E^llp — ||A A ||p 

3 

Hence, 



tr(M) = i • tr^r^"^) < I • (1 _ || 1 Aa||)2 • (d 2lA + V^,a- IIAaII 

which completes the proof. □ 
Lemma 13. 7/||A A || < I, then 



^max(^\ ^x l ^\ ) — 



1 



1-||A 



Proof. Observe that E x 1/2 £ A A7 A 1/2 = I + E x 1/2 {Ex - Z A )A7 A 1/2 =/ + Z A 1/2 (i; - E)E X 1/2 = 
/ + A A , and that 

A min (I + A A )>l-||A A ||>0 
by Weyl's theorem ( Horn and Johnson! . 19851 . Theorem 4.3.1) and the assumption || A A || < 1. There- 



fore 



(e]I 2 x; x 1 sj/ 2 ) — Amax^z^ ^ 2 i7 A i7 A 1 ^ 2 ) 1% ) < - - . □ 
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A Exponential tail inequalities 



The following exponential tail inequalities are used in our analysis. These specific inequalities were 
chosen in order to satisfy the general conditions setup in Section [2j however, our analysis can 
specialize or generalize with the availability of other tail inequalities of these sorts. 

The first tail inequality is for positive semidefinite quadratic forms of a subgaussian random vector. 
We provide the proof for completeness. 



Lemma 14 (Quadratic forms of a subgaussian random vector; iHsu et al.l . l2011bl ). Let A £ M. mxr ' 

be a matrix and U := AA T . Let X = (X\, . . . , X n ) be a random vector such that for some a > 0, 



E 



exp ( a X 



< 



exp 



(l|a||V/2) 



for all a £ W 1 , almost surely. For all 5 £ (0, 1), 



Pr 



\AX\\ 2 > a 2 tr(E) + 2a 2 ^ti(U 2 )log(l/5) + 2cr 2 ||I7|| log(l/5) 



< 5. 



Proof. Let Z be a vector of m independent standard Gaussian random variables (sampled inde- 
pendently of X). For any a £ R m , 



E 



exp ( Z a 



exp(|H| 2 /2) 



Thus, for any A £ M and e > 0, 



E 



exp XZ AX 



> E 



exp XZ AX 



\AX\\ 2 > e 



> exp 



X 2 e 



•Pr niAXlr > el . 



Moreover, 



E 



exp XZ T AX 



E 



< E 



E 



exp ( XZ T AX) 



exp 



X 2 a 2 , 



Pr niAXlr > e 



(3) 



T ry\\2 



\A> Z 



(4) 



Let U SV T be a singular value decomposition of A; where U and V are, respectively, matrices of 
orthonormal left and right singular vectors; and 5 = diag(y / pi , . . . , Jp m ) is the diagonal matrix of 
corresponding singular values. Note that 



IIpIIi = / J Pi = tr(Z% = = tr(S 2 ), and \\p\loo = maxpi = 

i=l i=l 1 

By rotational invariance, Y := U T Z is an isotropic multivariate Gaussian random vector with mean 
zero. Therefore \\A T Z\\ 2 = Z T US 2 U T Z = Pl Y? + ■■■ + p m Y^. Let 7 := X 2 a 2 /2. By a standard 
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bound for the moment gen erating function of linear combinations of ^-laxidom variables {e.g., 
Laurent and Massart . 200d ). 



< exp ||p||i7 + 



for < 7 < l/(2]|p|[ 00 ). Combining fl3J), (@D, and © gives 

Pr [||AX|| 2 > e] < exp f-ej/a 2 + ||p||i7 + 
for < 7 < l/(2|| / o|| 00 ) and e > 0. Choosing 

2 1 / / 

e := a \\\p\\i + r) and 7 

we have 



2IIHI 



1 



1 - 2 p |oo7 



IIpIIIt 2 

1 - 2||p||oo7 



\p\\\ + 2||p||ooT 



(5) 



Pr 



\AX\\ 2 >a 2 {\\p\\ l+ T) 



< 



exp I 



llplll 

2lloll 2 



1 + 



Ml 



1 + 



2JH|oqT 

\\p\\t 



exp I 



~ hl vw~ 



2||p|| 2 

"\\r Wo 



where h\{a) := l + a — \/l + 2a, which has the inverse function /i 1 = \/2b + b. The result follows 
by setting r := + 2||p|| tX) i = 2 v / tr(i7 2 )t + 2||I7||i. □ 

The next lemma is a general tail inequality for sums of bounded random vectors. We use the 
shorthand a\± to denote the sequence ai, . . . , ctfe, and ai : o is the empty sequence. 

Lemma 15 (Sums of random vectors). Let X\, . . . ,X n be a martingale difference vector sequence 
(i.e., E[Xj|Xi;i_i] = for all % = 1, . . . , n) such that 

n 

J^E [\\Xi\\ 2 I < v and \\Xi\\ < b 

i=l 

for all i = l,...,n, almost surely. For all 5 € (0, 1), 



Pr 



i=l 



> v 7 ^ (l + V81og(l/<J)) + (4/3)6 log(l/5) 



< (5 



The proof of Lemma [15] is a standard application of Bernstein's inequality. 

The last three tail inequalities concern the spectral accuracy of an empirical second moment matrix. 
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Lemm a 16 (Spectrum of subgaussian empirical covariance matrix; iLitvak et al.l . l2005l : also see lHsu et al. 
2011ah . Let X\, . . . , X n be random vectors in M rf such that, for some 7 > 0, 



E 



l:t-l 



/ and 



exp ( a Xj 



E 

for all i = 1, . . . , n, almost surely. 
For all T) G (0, 1/2) and 5 G (0, 1), 



Xi 



i-1 



< exp (||a||27/2) for all a G 



Pr 



A, 



(ji E x * x ^j > 1 + ' e ^ n or Amin u S X * X * T ) < 1 - • e ^>" 



< 5 



where 



32(rflog(l + 2/77)+log(2/a)) | 2(dlog(l + 2/7 ? )+log(2/^)) \ 
n n J 



Lemma 17 (Matrix Chernoff bound; Troppl . 201 ll ). Let X\, . . . , X n be random vectors in M. d such 
that, for some b > 0, 

E [||Xi|| 2 I > 1 and < 6 

/or all i = 1, . . . , n, almost surely. For all 5 G (0, 1), 



Pr 



i£x,xn<i 



i=l 



2b 2 d 

— lo S 7 

n 



< 8. 



Lemma 18 (Infinite dimensional matrix Bernstein bound; iHsu et all l2011al ). Let M be a random 
matrix, and b > 0, a > 0, and k > be such that, almost surely: 

E[M] = 0, 

Amax(M) < b, 

A max (E[M 2 ]) < a 2 , 
tr(E[M 2 ]) < a 2 k. 
If M\, . . . , M n are independent copies of M, then for any t > 0, 



Pr 



, If , \ \2oH bt 

Amax I - Z> Mi J > v — + ~ 



i=l 



n 3n 



< k ■ tie 1 - t-1) 



27 



B Application to fast least squares computations 



B.l Fast least squares computations 

Our main results can be used to analyze certain dat a pre-processing techniques designed for speedin g 
up over-complete least squares computations (e.g., Drineas et al. . 2010 ; Rokhlin and Tvgert . 20081 ). 



The goal of these randomized methods is to approximately solve the least squares problem 

min — II^Aio — b\\ 2 
w &R d JS 

for some large design matrix A E ~K Nxd and vector b E R . In these methods, the columns of A and 
the vector b are first subjected to a random rotation (orthogonal linear transformation) O £ M. NxN . 
Then, the rows of [OA, Qb] E R Arx ( d + 1 ) are jointly sub-sampled. Finally, the least squares problem 
is solved using just the sub-sampled rows. 

Let (X, Y) £ W 1 x R be a random pair distributed uniformly over the rows of [OA, Qb]. It can be 
shown that the bounded statistical leverage condition (Condition S]) is satisfied with 



„ / / log(N/S>) 

P2,cov = 4/1 + 



d J 

with probability at least 1 — 5' over the choice of the random rotation matrix G under a variety 
of standard ensembles (see below). We thus condition on the event that this holds. Now, let j3 be 
the solution to the original least squares problem, and let j3 \ s be the solution to the least squares 
problem given by a random sub-sample of the rows of [QA, Qb]. We have, for any w E R rf , 

L(w) = K[(X T w - Yf] = ^\\QAw - Qbf = L\\Aw - bf . 

Moreover, we have that Y - X T /3 = bias(X), so E[bias(X) 2 ] = L(j3). Therefore, Theorem [2] implies 
that if at least 

n > n 2 ,s = 0((d + \og(N/5')) ■ log(d/<f)) 

rows of [QA, Qb] are sub-sampled, then f3 \ s satisfies the approximation error guarantee (with prob- 
ability at least 1 — 5 over the random sub-sample): 

L(/3 ols ) - L(p) = O ^• L 3 H°g(V^ + bwer order ( 1 / n 2) terms> 

It is possible to slightly improve these bounds with more direct arguments. Nevertheless, our 
analysis shows how these specialized results for fast least squares computations can be understood 
in the more general context of random design linear regression. 



B.2 Random rotations and bounding statistical leverage 

The following lemma gives a simple condition on the distribution of the random orthogonal matrix 
Q G R JVxAr used to pre-process a data matrix A so that Condition H] (bounded statistical leverage) is 
applicable to the uniform distribution over the rows of QA. Its proof is a straightforward application 
of Lemma O We also give two simple examples under which the required condition holds. 
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Lemma 19. Suppose £ W NxN is a random orthogonal matrix and a > is a constant such that 
for each i = 1, . . . , N, for all a € R , and almost surely: 



E 



exp (a T (\/iVe T ei))l < exp (||a||V/2) . 



Let A e R Nxd be any matrix of rank d, and let E := (1/N)(@A) T (@A) = (l/N)A T A. There exists 



logjN/S) 2]og(N/S) 

P2,cov < 0-\ 1 + 2\l ; h 



d 



d 



such that 



Pr 



max \\S 1 / 2 (0yl) T ej|| > p2,covV / d 



i=l,...,N 



< s. 



Proof. Let Zi := \/iV0 T ei for each i = 1, . . . , JV. Let f7 G ]R Arxc ' be a matrix of left orthonormal 
singular vectors of A. We have 



By Lemma [Til 



Pr 



|^ 1 / 2 (e^) T ei|| = ||\/iVT/ T e T ei|| = \\U T Zi\\. 



U T Z l \\ 2 > a 1 (d + 2 v / dlog(N/5) + 2\og(N/5))] < 6/N. 



Therefore, by a union bound, 
Pr 



max^||i:" 1/2 (e^) T ei|| 2 > a 2 (d + 2y/dlog(N/6) +2log(N/S) 



< 5. 



□ 



Example 1. Let Q be distributed uniformly over all N x N orthogonal matrices. Fix any i = 
1, . . . ,N. The random vector V := G T ei is distributed uniformly on the unit sphere S N_1 . Let L 
be a x random variable with N degrees of freedom, so LV has an isotropic multivariate Gaussian 
distribution. By Jensen's inequality 



E 



exp (a T (v / iV6 T e i ))j = E [exp (a T (VNV)) 



E 



< E 



E 



exp 



E[L] 



a T (E[L]V) 



V 



N 



exp I ijr] aT{LV) 



\\a\\ 2 N 
2E[L] 2 



< exp 



< exp \\a\f 1 - — 



1 



4iV 360N 3 



-2 



/2 
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since 



E[L] >Vn[1 



1 1 

4/V ~ 360iV3 

Therefore, the condition is satisfied with a = I + 0(1/N). 

Example 2. Let N be a power of two, and let 6 := H diag(5) / VN , where H G {±l} AfxAr is the 
N x N Hadamard matrix, and S := (Si, . . . , S n ) G {±1}^ is a vector of N Rademacher variables 
(i.e., Si,...,Sn i-i-d. with Pr[5i = 1] = Pr[5i = —1] = 1/2). This random rotati on is a key 
co mponent of th e fast Johnson-Lindenstrauss transform of lAilon and Chazelk (|2009l ). also used 



bv lDrineas et all (|2010l ). For each ? 
5, and therefore 



iV, the distribution of y NO e } - is the same as that of 



E 



exp a 



,T i 



E 



exp(a T 5) < exp(||a|| 2 /2) 



where the last step follows by Hoeffding's inequality. Therefore, the condition is satisfied with 
a = 1. 
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